Parallel Training of CRFs: A Practical Approach to Build Large-Scale Prediction Models for Sequence Data

نویسندگان

  • H. X. Phan
  • M. L. Nguyen
  • S. Horiguchi
  • Y. Inoguchi
  • B. T. Ho
چکیده

Conditional random fields (CRFs) have been successfully applied to various applications of predicting and labeling structured data, such as natural language tagging & parsing, image segmentation & object recognition, and protein secondary structure prediction. The key advantages of CRFs are the ability to encode a variety of overlapping, non-independent features from empirical data as well as the capability of reaching the global normalization and optimization. However, estimating parameters for CRFs is very time-consuming due to an intensive forwardbackward computation needed to estimate the likelihood function and its gradient during training. This paper presents a high-performance training of CRFs on massively parallel processing systems that allows us to handle huge datasets with hundreds of thousand data sequences and millions of features. We performed the experiments on an important natural language processing task (phrase chunking) on large-scale corpora and achieved significant results in terms of both the reduction of computational time and the improvement of prediction accuracy.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Training Log Linear Models using Smoothed Hamming Loss

In a paper that is to appear in NIPS this year [1], we proposed a new objective function for training Conditional Random Fields (CRFs). When using the traditional log-likelihood training, the training objective function is fundamentally different from the testing objective (the accuracy of the resulting parse). We wanted to develop a method of training CRFs where the training and testing object...

متن کامل

Practical Very Large Scale CRFs

Conditional Random Fields (CRFs) are a widely-used approach for supervised sequence labelling, notably due to their ability to handle large description spaces and to integrate structural dependency between labels. Even for the simple linearchain model, taking structure into account implies a number of parameters and a computational effort that grows quadratically with the cardinality of the lab...

متن کامل

A Practical Desalinization Model for Large Scale Application

Salinity of soil and water is the most important agricultural hazard in arid and semi-aridregions. In saline soils, yield production directly influences by soluble salts in the root zone aswell as by shallow water table depth. The first step for reclamation of such soils is reducingsalinity to optimum level by leaching. The objective of this study was to develop a practicalmodel to estimate wat...

متن کامل

Parallelization of Rich Models for Steganalysis of Digital Images using a CUDA-based Approach

There are several different methods to make an efficient strategy for steganalysis of digital images. A very powerful method in this area is rich model consisting of a large number of diverse sub-models in both spatial and transform domain that should be utilized. However, the extraction of a various types of features from an image is so time consuming in some steps, especially for training pha...

متن کامل

Prediction of the waste stabilization pond performance using linear multiple regression and multi-layer perceptron neural network: a case study of Birjand, Iran

Background: Data mining (DM) is an approach used in extracting valuable information from environmental processes. This research depicts a DM approach used in extracting some information from influent and effluent wastewater characteristic data of a waste stabilization pond (WSP) in Birjand, a city in Eastern Iran. Methods: Multiple regression (MR) and neural network (NN) models were examined u...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2007